Multidimensional biodata integration and relationship inference

ABSTRACT

This invention provides an advanced platform for the analysis of biological data that emphasizes pathway mapping and relationship inference based upon data acquired from multiple diverse sources. The platform employs a bioinformatic system that integrates data from the diverse sources, connecting related genes and proteins and inferring biological functions in the context of global cellular processes.

CROSS REFERENCE TO RELATED APPLICATIONS

The present application claims priority under 35 U.S.C. 119(e) from U.S.Ser. No. 60/298,689, filed Jun. 14, 2001, the disclosure of which isincorporated by reference herein.

BACKGROUND OF THE INVENTION

This invention relates to systems for the collection and manipulation ofbiodata from diverse sources and the processing of such biodata toidentify potential therapeutic targets.

The human genome project, with its goal of complete genome sequencing,has examined three gigabytes of human genomic DNA and predicts thatapproximately 30,000 genes are resident in the human genome. However,identification and sequencing of a gene are but the first steps in itscharacterization. The challenge is to determine the function of the geneas well as its relationship to other genes. With this information,directed experimentation to identify genes that are likely targets fortherapeutic intervention becomes feasible and, ultimately, the drugdiscovery timeline will be shortened.

Genes contain genetic information that is transcribed into messenger RNAand then translated into protein. Proteins play a critical role incellular processes. Functional proteomics seeks to identify a protein'sfunction and related pathway roles through large-scale, high-throughputexperiments. Protein functional analysis systematically determinesprotein-protein interactions. Protein interactions mediate cellularsignaling cascades that are not typically linear, but are more likelyrepresented by a complex branched network. When unknown proteinsinteract with previously characterized proteins, information about theirfunction and role in the same or related cellular process may beobtained.

Most commercially available bioinformatics systems perform functionalanalysis using a single information source such as a traditionalrelational database optimized for transactional database processing.Such systems do not integrate collections of data from various sources.Conversely, an intelligent system that integrates data derived frommultiple sources would allow for the integration of data from variousoperational databases and, thus, enhance research efforts which focus onspecific therapeutic targets.

SUMMARY OF THE INVENTION

This invention provides an advanced platform for the analysis ofbiological data that relies upon pathway mapping and relationshipinferences drawn from data acquired from multiple diverse sources. Theplatform employs a bioinformatic system that integrates data from thediverse sources, connecting related genes and proteins and inferringbiological functions in the context of global cellular processes.

The invention may be conceptually understood as having four primarycomponents: data collection, data integration, dataanalysis/relationship inference and inference presentation.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1. Data analysis and data warehouse system. Data from varioussources is input into the operational relational database system(RDBMS), extracted and cleaned, then loaded into the data warehousesystem, which organizes the data and infers relationships among thestored data, predicting missing attribute values for incoming data.

FIG. 2. Sample data view in a multidimensional database.

FIG. 3. Analysis of two characteristic matrices. The left panel matrixhas the same cardinality in both dimensions. The matrix contains eachgene's binding patterns. Domain analysis of each of the 391 genesresulted in a total of 849 domains as shown in the right panel. Thenegative normalized logarithm of the F-value was concatenated to thefirst matrix as shown in the right panel.

FIG. 4. A protein-protein interaction network in a human cell probed bythe YTH system. Each gene is represented by a gray dot, and the edgesconnecting two genes represent specific interactions between the genes.For the apoptosis example, green represents the genes from the YTHmatrix analysis, and blue represents additional genes from analyses ofboth YTH and domain information. As the network demonstrates,TNF-induced apoptosis pathway members (blue and green) are scattered inthe YTH network, whereas domain information links them together.

FIG. 5. Genes identified in the apoptosis pathway of the network arerepresented as a hierarchical tree format. The genes closer to eachother on a branch of the tree have a higher correlation based onspecific interactions in the YTH system.

FIG. 6. Genes identified from the clustering analysis based on YTH anddomain data. Genes in green are identified by YTH data, genes in blueare newly identified when domain data is added.

FIG. 7. Flowchart outlining the system of this invention.

DETAILED DESCRIPTION OF THE INVENTION

Embodiments of the present invention provide techniques for identifyingpotential therapeutic targets. In order to identify potentialtherapeutic targets, an embodiment of the present invention determinescharacteristics of genes and proteins based upon information availablefrom various public and private information sources. FIG. 7 is asimplified high-level flowchart 100 depicting a method of determiningcharacteristics of genes and/or proteins based upon informationavailable from various public and private information sources accordingto an embodiment of the present invention. The method may be performedby a data processing system that may comprise a memory subsystem and oneor more processors. For example, the processing depicted in flowchart100 may be performed by software modules that are stored by the memorysubsystem of the data processing system and are executed by one or moreprocessors of the data processing system. The processing may also beperformed by hardware modules coupled to the data processing system, orby a combination of software modules and hardware modules. It should beunderstood that flowchart 100 is merely illustrative of an embodimentincorporating the present invention and does not limit the scope of theinvention as recited in the claims. One of ordinary skill in the artwould recognize variations, modifications, and alternatives.

As depicted in flowchart 100, information from various informationsources is gathered (step 102). As described below in further detail,the information sources may include public and private informationsources (e.g., publicly accessible gene databases such as GenBank,databases storing micro-array information such as the StanfordMicro-array Depository, databases that store published information suchas Medline, private information sources, experiments, and the like). Theinformation may be collected manually or in an automated manner.

The information collected in step 102 is then stored in a format thatfacilitates analysis of the information (step 104). As described belowin further detail, the information may be stored in one or moredatabases that are integrated using a data warehouse-basedinfrastructure. According to an embodiment of the present invention, theinformation is also stored or represented in the form of characteristicmatrices. Each characteristic matrix may store and represent informationfor a particular dimension of data. For example, a first characteristicmatrix may store information related to functional assays, a secondcharacteristic matrix may store information related to protein-proteininteractions, a third characteristic matrix may store informationrelated to ontology mappings, a fourth characteristic matrix may storeinformation related to fold recognition, and on. Further information onthe different characteristic matrices that may be used is describedbelow.

The information stored in step 104 is then analyzed (step 106). Variousdifferent analysis techniques may be used. For example, according to anembodiment of the present invention, multivariate analysis techniquesincluding clustering analysis techniques are used to analyze theinformation stored in step 104. According to an embodiment of thepresent invention, the information that is stored in databases andinformation that is represented by the characteristic matrices isanalyzed. The characteristic matrices facilitate multidimensionalanalysis of the data.

Inferences, deductions, and/or conclusions are then drawn based upon theresults of the analysis performed in step 106 (step 108). For example,according to an embodiment of the present invention, clustering analysisyields clusters of genes and proteins that are co-related together.Inferences can then be drawn from the clusters formed as a result of theclustering analysis. For example, genes and proteins with similarprofiles based upon various biological experiments/observations that areclustered together are more likely to have similar cellular functionsand be involved in the same biological pathways. As a result,characteristics of novel (or previously unknown) genes and functions ofnovel proteins can be inferred based upon characteristics and functionsof the known genes or proteins in the same cluster. In this manner,various inferences, deductions, and conclusions can be drawn from theresults of the analysis performed in step 106.

The inferences, deductions, and/or conclusions drawn in step 108 maythen be output to a user (step 110). Various different techniques may beused to output the results to the user. For example, according to anembodiment of the present invention, the information may be output tothe user in response to a query received from the user. Variousdifferent user interfaces may be used to output the information to theuser.

The processing performed in each step of flowchart 100 is describedbelow in further detail.

The four primary components of the multidimensional biodata integrationand relationship inference platform are described in detail below.

a) Definitions

The following definitions are set forth to illustrate and define themeaning and scope of the various terms used to describe the invention.

The term “therapeutic target” means any environment or molecule (often agene or a protein) that is instrumental to a disease process, though notnecessarily directly involved, with the intention of finding a way toregulate that environment's or molecule's activity for therapeuticpurposes.

By the term “biodata” is meant, for the purpose of the specification andclaims, any biological data compiled in one or more database(s) and/ordata warehouse including, but not limited to, biological data related tomolecular pathways, cellular processes, protein-protein interaction,protein structure, genetics, molecular biology, expression arrays,functional assays, and genomes.

The term “data warehouse” refers to a repository where data frommultiple databases is brought together for more complex analysis. It isalso a physical repository where relational data are specially organizedto provide enterprise-wide, cleaned data in a standardized format.

The term “data mart” means a subset of a data warehouse where datarelevant to a particular query is stored.

The term “archival biological data” means biological data that has beenarchived or compiled in one or more database(s) and/or data warehouse(s)and can be accessed by a user. Examples of archival biological data aredomain analyses, ontology vocabulary mapping, fold recognition, genesequences, expressed sequence tags (ESTs), single polynucleotidepolymorphisms (SNIPs), biochemical functions, physiological roles andstructure/function relationships.

A “multidimensional database” as used herein, is a database in whichdata is organized and summarize in multiple dimensions for easiercomprehension. By performing queries, users can create customized slicesof data by combining various fields or dimensions.

The term “compound library” refers to a large collection of compoundswith different chemical properties or shapes, generated either bycombinatorial chemistry or some other process or by collecting sampleswith interesting biological properties. This compound library can bescreened for drug targets. For example, a lead compound is a potentialdrug candidate emerging from a screening process of a large library ofcompounds.

“High-throughput screening (HTS)” refers to rapid in vitro screening oflarge numbers of compound libraries (generally tens to thousands ofcompounds), using robotic screening assays.

A “protein array” or “protein microarray” refers to a multi-spot,metallic or polymeric device with surface chemistries used for affinitycapture of proteins from complex biological samples. A “protein arrayanalysis” refers to the use of a protein array in order to evaluatepotential polypeptide(s) of interest. For example, using a proteinmicroarray made up of several hundred antibodies it is possible tomonitor alterations of protein levels in specific cells treated withvarious agents.

A “nucleic acid array” or “DNA microarray” as used herein means a device(e.g., glass slide) for studying how large numbers of nucleic acids(e.g., cDNA, genomic DNA, RNA, mRNA, SiRNA, etc.) interact with eachother and how a cell controls vast numbers of nucleic acidssimultaneously. Tiny droplets containing nucleic acids are applied toslides and fluorescently labeled probes are allowed to bind to thecomplementary nucleic acid strands on the slides. The slides are scannedand the brightness of each fluorescent dot is measured. The brightnessof the dot reveals the presence and quantity of a specific nucleic acid.A “nucleic acid array analysis” employs such arrays for the analysis ofnucleic acid(s) of interest.

The term “functional gene screening” refers to the use of a biochemicalassays in order to screen for a specific protein, which indicates that aspecific gene is not merely present but active.

The term “expressed sequence tag (EST)” referst to a nucleic acidsequence made from cDNA which comprises a small part of a gene. An ESTcan be used to detect the gene by hybridizing the EST with part of thegene. The EST can be radioactively labeled in order to locate it in alarger segment of DNA A “single nucleotide polymorphism (SNP)” refers tochanges in a single base pair of a particular gene happeningsimultaneously in a population.

The term “structure/function relationship” refers to the therelationship between the structure and organization of the gene and thefunction of the gene as it directs growth, development, physiologicalactivities, and other life processes of the organism. It also refers tothe structure and organization of the protein and function of theprotein as cellular building block and/or participant in cellularprocesses and pathways.

The term “cleaned” means, for the purpose of the specification andclaims, the process of mating data that is being imported into a datawarehouse more accurate by removing mistakes and inconsistencies.

The term “time variant data” refers to data whose accuracy is relevantto any one moment in time. Thus, the term “time variant database” refersto a database that contains, but is not limited to, such time variantdata.

“Cluster analysis”, as used herein, refers to clustering, or grouping,of large data sets (e.g., biological data sets) on the basis ofsimilarity criteria for appropriately scaled variables that representthe data of interest. Similarity criteria (distance based, associative,correlative, probabilistic) among the several clusters facilitate therecognition of patterns and reveal otherwise hidden structures.

An “supervised analysis” refers to a data analysis technique whereby amodel, is built without a well defined goal or prediction field. Thesystems are used for exploration and general data organization. An“unsupervised clustering analysis” is an example of an unsupervisedanalysis.

The term “algorithm” means a procedure used to solve a mathematical orcomputational problem or to address a data processing issue. In thelatter sense, an algorithm is a set of step-by-step commands orinstructions designed to reach a particular goal or a step-by-stepsearch, where improvement is made in every step until the best solutionis found.

A “neural net” or “artificial neural network” refers to computertechnology that operates like a human brain, such that computers possesssimultaneous memory storage and work with ambiguous information.

A “relational database” refers to a database in which data is stored inmultiple tables. These tables then “relate” to one another to make upthe entire database. Queries can be run to “join” these related tablestogether. An “operational database” comprises system-specific referencedata and event data belonging to a transaction-update system. It mayalso contain system control data such as indicators, flags, andcounters. The operational database is the source of data for the datawarehouse. The data continually changes as updates are made, and reflectthe current value of the last transaction.

b) Data Collection

Data may be obtained from both experimental and archival sources. Thedata sources are diverse and constitute the basic input for the system.Representative data may be obtained, inter alia, from gene functionalassays, protein-protein interaction studies (e.g. yeast two-hybridscreening, proteomic chip or chromatographic analyses), ontologyvocabulary mapping, fold recognition, nucleic acid array expressiondata, gene domain analysis, and proprietary, published or otherwisepublicly accessible archived gene and protein sequences, EST's,structure-function relationships, and related chemical, clinical andphysical data. In a given embodiment, any number of data sources may becalled upon to invoke the platform.

Functional assays. Cell cycle functional screening assays may be used toidentify small fragments of genes or peptides that cause arrest atdifferent cell cycle phases. Information associated with each assayincludes experimental protocols, reagents, and raw data such aselectrophoresis gel images.

Protein-protein interactions. To further characterize proteinsdemonstrating a desired functional assay phenotype, proteins interactingwith each other may be identified through a variety of techniques,including, but not limited to yeast two-hybrid (YTH) screening andproteomic chip and chromatographic analyses.

The yeast-based two-hybrid (YTH system is a common method forlarge-scale experimental detection of protein-protein interactions. InYTH systems, the protein of interest (the bait) is fused to a fragmentof a known DNA-binding protein such as GAL4, anchoring the bait to acalorimetric reporter gene. Potential interacting proteins (screenedmembers expressed from a cDNA library) are then attached to a cognatetranscriptional activating protein that can activate the reporter gene,producing an easily monitored color change in yeast cells. This colorchange results from the direct physical interaction between theproteins.

Mapping protein interactions through protein chip patterns is analternative high-throughput methodology. In addition, 2D polyacrylamidegel electrophoresis (2D GEL) coupled with mass spectroscopy now providesproteomic fingerprints, digestion patterns of protein complexes indifferent cellular states.

LIMS (Laboratory Information Management Systems) is used to track largescale cloning process, sequencing and down stream data generated fromdifferent functional assays, yeast-two-hybrid screening, and proteomicsexperiment. Functional screens include, but are not limited to cellcycle regulation, angiogenesis, T cell activation, B cell activation,IgE class switch, etc.

For example, the goal of cell cycle functional assay is to identifygenes that cause arrest at different cell cycle phases. The identifiedgenes are then evaluated as targets for therapeutic intervention to slowtumor growth or tumor genesis. The process may be is described asfollows. Cell tracker dye is applied to tumor cells that host differentcDNA fragments. After several days growth, the slow replication cellsare sorted out using FACS (fluorescence-activated cell sorter). The cDNAthat confers this phenotype is then extracted using RT-PCR (reversetranscriptase-polymerase chain reaction). Information associated withthe whole process includes experimental protocols, reagents, genesequences, and raw data (such as histograms or dot plots generated byFACS). Raw data results from each functional assay are then transferredto data warehouse for storage.

LIMS is a task-based workflow system implemented in JAVA technology witha generic schema and open architecture. It supports various workflowpatterns fork, option, loop, merge, two types of nodes (entity node,bridge node) and exit/entry conditions. Tasks are assigned based onroles or inherited from the parent tasks. A message system is used tosend notices to users for pipeline modification. Security management,transaction management and resource management are also implemented inthe system. The scheduler can schedule certain tasks based on one timeexecution or repeat execution. XML technology is used generally for theworkflow descriptor and object distribution. The workflow deploymenttool deploys the specific pipeline based on the descriptor. The frontend of this system is browser based.

Data from public sources are gathered automatically by web crawler(written in PERL script language) or periodically dispatched UNIXprocesses. Various parsers are developed to extract the information weneed for downstream data warehouse storage. These data sources include:ontology classification, co-regulation value based on expression array,public sequences (EST, protein and nucleotide), and individual genomesfrom various species as described in the following.

We use ncftp which has been scheduled by Unix cron job running everyweek to fetch the EST, proteins and nucleotides from NCBI atftp://ftp.ncbi.nih.gov/blast/db. The raw data is stored on server disk.A LWP based Perl script takes three URLs (http://www.tigr.org,http://genome.ucsc.edu, http://www.fruitfly.org/) as its input and theoutput is parsed and stored as a text file. The following data sourcesare generated based on computation applied to the original data sourcesfrom the above: domain analysis, fold recognition, and protein linksbased on multiple genomes.

Published literature may also be used either by manually extractinginformation or using natural language processing (NLP) for incorporationinto database.

Ontology vocabulary mapping. An ontology provides a formal writtendescription of a specific set of concepts and their relationships in aparticular domain (P. D. Karp, An Ontology for Biological Function Basedon Molecular Interactions, Bioinformatics, vol. 16, 2000, pp. 269-285).One ontology is based upon Stanford University's Gene Ontology (GO)Consortium (www.geneontology.org). The LWP based Perl script uses HTTPGET to fetch the three ontology files (component.ontology,process.ontology, function.ontology) from the ftp site(ftp://ftp.geneontology.org/pub/go/ontology/). The files are stored onthe server disk and are parsed by matrix generation program.

GO has three categories: molecular function, biological process, andcellular component. A gene product has one or more molecular functionsand participates in one or more biological processes. The gene productmight be a cellular component or it might be associated with one or moresuch components. Each element's ontology is represented on an acyclicdirected graph. The nodes at the upper branches have more generalcharacteristics, while end nodes have relatively specific attributes,including inheritance of parental, characteristics.

Fold recognition. Although there are several hundred thousand proteinsin the nonredundant protein database at the US National Center forBiotechnology Information (NCBI), it is estimated that there are onlyabout 5,000 unique native 3D structures or folds. Most frequentlyoccurring folds were determined experimentally. PROSPECT, a threadingpackage developed at Oak Ridge National Lab was used to predict novelgene folds (Y. Xu and D. Xu, Protein Threading Using Prospect: Designand Evaluation, Proteins, vol. 40, 2000, pp. 343-354). Threadingsearches use structure templates to find a query's best fit. PROSPECThas three components:

-   Libraries of representative 3D protein structures for use as    templates, including protein chain (2,177 templates defined by the    families of structurally similar proteins [FSSP] nonredundant set)    and compact domains (771 domains defined by the    distance-matrix-alignment [DALI] nonredundant domain library).-   A knowledge-based energy function describing the fitness between the    query sequence and potential templates.-   A “divide-and-conquer” threading algorithm that searches for the    lowest energy match among the possible alignments of a given    query-template pair. The algorithm first aligns elements of the    query sequence and the template, and then merges the partial results    to form an optimal global alignment.

A neural network derives a criterion to estimate the predictedstructure's confidence level. Typically, the criterion selects aboutfive statistically significant hits for a query protein.

Coregulation from array expression. Nucleic acid arrays hybridizelabeled RNA or DNA in solution to nucleic acid molecules attached atspecific locations on high-density array surfaces. Hybridization of asample to an array is a highly parallel search allowing complex mixturesof RNA and DNA to be interrogated in a high throughput and quantitativefashion. DNA arrays can be used for many different purposes, butpredominantly they measure levels of gene expression (messenger RNAabundance) for tens of thousands of genes simultaneously (D. J. Lockhartand E A Winzeler, Genomics, Gene Expression and DNA Arrays, Nature, vol.405, 2000, pp. 827-836). Chips with hundreds or thousands ofoligonucleotide sequences representing partial gene sequences may beconstructed. Hybridizing mRNA derived from different samples, forexample, cancerous versus normal tissue, provides information about geneexpression under different cellular conditions. Gene function may beinferred by correlating differential mRNA expression patterns.

If a gene has no previous functional assignments, one can give it atentative assignment or a role in a biological process based on theknown functions of genes in the same expression cluster (the“guilt-by-association” concept). This is possible because genes withsimilar expression behavior (for example, parallel increases anddecreases under similar circumstances) tend to be related functionally.

Collaboratively, the National Cancer Institute (NCI) and StanfordUniversity have tested the expression of 8,000 unique genes in 60 celllines used in NCI's anticancer drug screening. The Stanford microarraydepository website:

(http://genome-www5.stanford.edu/cgi-bin/SMD/listMicroArrayData.pl?tableName=publication&5306)

may be used as a data source. The program is implemented by PERL LWP asa Unix terminal command-based script that has been scheduled to runevery month for data update. The input of the program is the Standfordmicroarray database URL and output is the parsed ASCII-text based flatfile. The program uses the GET method from the HTTP protocol to fetchthe HTML-format data from the above web site, then parses the data intoan ASCII-text based flat file and uses the hard disk as its secondarypersistent data storage.

Domain analysis. Gene domain analysis is based on a hidden Markov modelssearch (hmmsearch, HMMER 2.0 suite) against the Pfam model set of 2,773domains downloaded from http://pfam.wustl.edu with an E-value cutoff of0.0005. A multiple-thread Perl program is implemented for the domainanalysis. It takes a FASTA format protein file as the input and invokesa system call to trigger the Pfam search for each protein sequence. Itthen parses the raw output to a hash data structure which stores domainname, e-value, alignment position/gap for each protein. The hash datastructure is persistent by the Unix file system.

Protein links based on multiple genomes. HTS allows an increasing numberof genomes to be sequenced. Given the assumption that genes present inthe genomes of multiple species share similar evolutionary histories(phylogeny) and might therefore share similar functions, Eisenberg andcolleagues proposed to infer potential protein-protein links from geneswith similar phylogenetic profiles (D. Eisenberg et al., ProteinFunction is the Post-Genomic Era, Nature, vol. 405, 2000, pp. 823-826).Approximately 50 completed genomes from the Institute for GenomeResearch and the Sanger Center were used to generate phylogeneticprofiles (a vector with length of 50) for each gene. The vector value isthe actual E-value obtained from the Basic Local Alignment Search Tool(BLAST) search.

c) Data Characterization

Separate individual operational relational databases are constructed tostore the raw information from each of the above dimensions. There is nolimit to the number of sources or dimensions which may be relied upon,although typically from five to twenty, preferably six to ten, are used.These databases provide the conventional query and search based on asingle type of data source.

In additional to the raw data storage, each source is also convertedinto a numerical matrix. The result is what is called a characteristicmatrix. Rows stand for genes of interest, and columns represent aparticular attribute within that dimension. The element of the matrix isthe value of each gene fit to that attribute.

The numerical matrix may be generated for each individual data source asfollows.

For the functional assay, the row is the gene and the column is theassay type/name. The matrix element is the degree of inhibition oractivation for each functional assay.

Specific protein-protein interactions are shown as a blue color assay inthe YTH screening experiment. The binding affinity degree is isrepresented as one of four levels: strong, medium, weak, and none,corresponding to 1, 0.8, 0.6 and 0 in the binding matrix. The level ofprotein-protein interaction extracted from the proteomics experiment isbased on eight internal experimental parameters. The row is the baitgene, the column is the hit gene. The element is the binding affinitydescribed above.

For ontology mapping, a high-dimension word space is constructed byextracting the description words for each GO node and merging them withthe key words parsed from the Medline title and abstract of each gene.Each unique key word is one dimension of the word space. The distancesbetween each gene and the GO nodes are calculated and ranked by theEuclidean distance in the word space. The gene's definitive GO mappingnodes are either manually selected or the top five ranked GO nodes atrechosen to represent the relationship between the gene of interest andthe GO vocabulary. The row of the matrix is the gene, the column is theunique description word from GO and the element is the mapping describedabove.

For fold recognition, the row of the matrix is the gene and the columnis protein with crystal structure. The element is either zero if thereis no match between the gene of interest to the crystal structure or theactual confidence score outputs from the neural network described in theprevious section.

For the co-regulation value from the expression array, the row of thematrix is the gene and the column is the condition of the experimentperformed. The element is the logarithm of the ratio of expressionlevels of a particular gene under two different experimental conditions.

For domain analysis, the row of the matrix is the gene and the column isthe domain name obtained from Pfam. The element is the negativelogarithm of the E-value obtained from the HMM search.

For protein links based on multiple genomes, the row of the matrix isthe gene and the column is the name of the complete genome. The elementis the actual negative logarithm of the lowest E-value obtained from theBLAST search queried by the gene against that particular completegenome.

Each characteristic matrix may be processed individually or byconcatenating them into a larger matrix with multiple metricmeasurements as a part of the multidimensional analysis. Optionally,each matrix may be weighted using a variety of techniques. In oneembodiment, each data source is equally weighted, i.e., with equalweight of 1. In another embodiment, based on prior knowledge, decreasethe weight on data sources with higher false positives or lessbiological significance and/or increase the weight on more reliable orsignificant data. In a third embodiment, Bayesian statistics may be usedto calculate the conditional posterior probability of the weight foreach source based on the likelihood of observing the data based on thedata source as the following:

${P\left( S \middle| D \right)} = \frac{{P\left( D \middle| S \right)}*{P(S)}}{\sum\limits_{s}\;{{P\left( D \middle| S \right)}*{P(S)}}}$

where S is the data source, D is the characteristic value of each genefor that particular data source. P(D|S) is the likelihood of observingthe data given this particular data source, P(S) is the a prior of thedata source, which can be uniformly distributed in the beginning.

d) Data Integration

The data warehouse-based infrastructure supports bioinformatic analysisof potential molecular therapeutic targets. Unlike traditionalrelational database management systems (RDBMSs), which are optimized fortransactional database processing, the data warehouse of this inventionis an integrated, time-variant, nonvolatile collection of data fromvarious operational databases. In this bioinformatics system,operational relational databases contain data generated from specificmethodologies, as discussed above.

A gene-centric analysis of the data warehouse allows large-scale dataintegration, relationship learning, and decision-making based on dailyupdated operational relational databases. The entire system serves twoprimary purposes:

-   It organizes existing data to facilitate complex queries.-   It infers relationships based on the stored data and subsequently    predicts missing attribute values for incoming information based on    multidimensional data.

Original data from a plurality of dimensions, or independent sources, isentered (i.e., parsed) into operational databases and updated daily. Theoperational databases are based on the relational database system andare tuned to support large-scale and frequent transactions. Dataanalysis engines extract, clean, and analyze the data from thesedatabases and load the analyzed data and metadata into the datawarehouse. The core data warehouse system design is based on the starschema of Oracle's 8i, which implements multidimensional databases.

Data marts (extensions of the data warehouse) support different queryrequests. Data marts derive data from the central data warehouse inresponse to different inquiries (i.e., queries). A detailed-summarysystem is a data mart built on the traditional RDBMS design with fixedmetadata tables to summarize and aggregate raw data. Another type ofdata mart is based on a multidimensional database design. The latteroffers superior representations of diverse data views, which it obtainsby comparing various aspects of the analysis environment at differentdetail levels.

The star schema uses de-normalized storage to provide data views fromindividual or multiple dimensions with high efficiency. It offersmultidimensional solutions that analyze large amounts of data with veryfast response times, “slices and dices” through the data, and drillsdown or rolls up through various dimensions defined by the datastructure. It is also easy to scale.

In order to easily visualize data from multiple views, the system islimited to a 3D data cube. In one example, genes discovered in thefunctional assays of two projects, a T and B lymphocyte cell project andcell-cycle project, are used to construct a data cube as a function oftime and each gene's domains. As FIG. 2 shows, a cube slice represents adata view as a function of the various dimensions. Regular normalizedRDBMS design requires expensive multiple table join operations toextract the information in the data cube, whereas the multidimensionaldatabase exploits the schema design to facilitate a faster and morereliable response.

d) Data Analysis and Relationship Inference

There are numerous methods known in the art to explore relationshipextraction and inference deduction. In one embodiment, the system ofthis invention uses cluster analysis, a multivariate analysis techniquethat seeks to organize information about variables to form relativelyhomogeneous groups. Some common cluster analysis methods arehierarchical, k-mean, self-organizing, mapping, and support vectormachines. All of these methods employ distance functions to compare rawdata and recognize grouping characteristics.

Hierarchical clustering may be applied to the characteristic matrices.Hierarchical clustering offers flexibility in determining the exactcluster number and statistical assessment of members' relatedness alongthe branching tree (M. B. Eisen et al., Cluster Analysis and Display ofGenome-Wide Expression Patterns, Proc. Nat'l Academy of Science, U.S.A.95, Nat'l Academy of Sciences, Washington D.C., 1998, pp.14,863-14,868). The distance matrix is based on the Pearson correlationcoefficient between any of the two genes X and Y from the originalcharacteristic matrix:

$\begin{matrix}{\rho_{x,y} = {\frac{1}{N}{\sum\limits_{i = 1}^{N}\;{\left( \frac{X_{i} - \overset{\_}{X}}{\sigma_{X}} \right)\left( \frac{Y_{i} - \overset{\_}{Y}}{\sigma_{Y}} \right)}}}} & (1)\end{matrix}$

where N is the characteristic matrix dimension in the attributedirection, and σ is the standard deviation of the gene attribute.

$\begin{matrix}{\sigma_{A} = \sqrt{\sum\limits_{i = 1}^{N}\;\frac{\left( {A_{i} - \overset{\_}{A}} \right)^{2}}{N}}} & (2)\end{matrix}$

Clustering is achieved by recursively joining the two elements with thehighest Pearson correlation coefficients in the upper-diagonal distancematrix until the distance matrix dimension reduces to one in thedirection that the joining is performed. This process leads toclustering of genes with a similar binding vector profile. A tree-baseddendrogram is produced with end nodes of relatively high correlation toeach other, and nodes in the upper branch with less correlation to eachother. The clustering algorithm uses an unsupervised hierarchicalclustering that discovers common characteristics among genes withoutprior knowledge of them. As knowledge accumulates, it may be helpful touse a set of truly related genes as the training set for a relationshipinference model.

Support vector machines (SVM), a class of supervised clusteringalgorithms, have been successfully applied to functional classificationwhen combining data from both array expression experiments andphylogenetic profiles (P. Pavlidis et al., Gene FunctionalClassification from Heterogeneous Data, Proc. 5^(th) Int'l Conf.Computational Molecular Biology, ACM Press, New York, 2001, pp.242-248). SVM, projects the original data into a higher dimensionalspace, the feature space, and defines a separating hyperplane todiscriminate class members from nonmembers, an operation difficult toperform in the original space. SVM does not require all clusters to havespherical contours, an underlying assumption of many unsupervisedclustering algorithms, such as k-mean.

For supervised leaning, support vector machine with an inner productraised to a certain power (n=1, 2, 3) may be usedK({right arrow over (X)},{right arrow over (Y)})=(({right arrow over(X)}*{right arrow over (Y)}/√{square root over ({right arrow over(X)}*{right arrow over (X)})}√{square root over ({right arrow over(Y)}*{right arrow over (Y)})})+1)^(n)

or a radial basis kernel and a 2-norm soft marginK({right arrow over (X)}*{right arrow over (Y)})=exp(−∥{right arrow over(X)}−{right arrow over (Y)}∥ ²/2σ²)

where {right arrow over (X)} is the vector from the matrix thatdescribes gene X.

For supervised learning, feature selection is performed to extract thegenes with the highest F score defined by the Fisher criterion to makethe classification more accurate and informative:F(j)=(μ⁺ _(j)−μ⁻ _(j))²/((σ⁺ _(j))²+(σ⁺ _(j)) ²)

Where μ⁺ _(j) and σ⁺ _(j) are the mean and standard deviation of thatfeature across the positive examples, and μ⁻ _(j), σ⁻ _(j) are from thenegative examples.

After clustering or other classification, genes with similar vectordescriptions within the matrix may be correlated as demonstrated inFIGS. 5 and 6. Genes and proteins with similar profiles based on variousbiological experiment/observations described above are more likely tohave similar cellular functions and be involved in the same biologicalpathway.

Genes and proteins having similar functions or participating in the samebiological pathway are clustered/classified together. Therefore,functions of novel genes and proteins can be inferred based on functionsof the known genes or proteins in the same cluster.

The system effectively blends all available information to establishfunctional linkages among proteins. This information may be used toidentify genes that are likely drug target candidates. Smallmolecule-based high throughput screening (HTS) of the validated targetsmay be used to identify lead compounds. The lead compounds may then besubjected to downstream biochemical and cell-based assays foroptimization. After confirming the functional specificity and activityof the optimized lead compounds, their drug effects may be furthercharacterized in animal models and preclinical studies.

EXAMPLE

Clustering methods were used to analyze genes from YTH screening. Theoriginal data set contained about 300 baits and 2,300 hits. A total of391 baits and hits with multiple interactions were selected for theanalysis shown in FIG. 3. FIG. 4 shows the protein-protein interactionnetwork prior to gene clustering.

After clustering the first characteristic matrix, proteins that havesimilar functions or participate in the same pathway were grouped. Asshown in FIG. 5, FLAME1-γ (AF009618), Homo sapiens FLAME1 mRNA(Af009616), MRIT-α-1 (U85059), I-FLICE (AF041458), and CLARP(AF005774.1) all encode the same protein, which is commonly referred toas FLIP. FLIP is a death-domain-containing anti-apoptotic molecule thatregulates Fas/NFR1-induced apoptosis.

Another group close to FLIP in the hierarchical tree contains TRAF1 (TNFreceptor associated factor 1, NM_(—)005658), TRAF2 (U12597), TANK (TRAFfamily member associated NFKB activator, XP_(—)002533.1), receptorinteracting protein (RIP), RIP-like kinase (AF156884), BCL-2 associatedatlianogene (XP_(—)005538.1), protein phosphatase 2 regulatory subunit B(NP_(—)006234.1), and an unknown gene (BAB25712.1).

The literature has shown that TNF receptors lack intrinsic catalyticactivity. Death-domain-containing proteins RIP andTRAF-domain-containing proteins TRAF1, TRAF2, TRAF3) bridge TNFreceptors to several downstream signaling pathways. This bridging causesdiverse cellular responses including cellular proliferation,differentiation, effector functioning, and apoptosis. Proteinphosphatase 2A (PP2A) affects a variety of biological events includingapoptosis.

BAD is a pro-apoptotic member of the BCL2 family of proteins. PP2A candephosphorylate BAD, resulting in apoptosis. BCL-2-associated athanogene(BAG-1) is a heat shock 70-(Hsp70)-binding protein that can collaboratewith BCL-2 to enhance the anti-apoptotic activity of BCL-2.

The literature confirms that these proteins are all involved inapoptotic and anti-apoptotic signaling events initiated by TNF andBCL-2. The unknown protein BAB25712.1 might therefore play a role in TNFor BCL-2 signaling pathways. Our system placed FLIP in the vicinity ofRIP and the BCL-2-associated chaperon BAG-1. Several independent studiessupport this protein linkage assignment. FLIP can modulate the NFkappaBpathway and physically interacts with several signaling proteins, suchas the TRAFS and RIP. FLIP can also interact with a BCL-2 family member.

FIG. 6 shows the significant refinement in results when HMM domainsearch information was incorporated as an additional dimension. First,more proteins involved in TNF receptor-induced apoptosis join the group.Both CAP-1 (CD40-associated protein, L38509) and CRAF-1 (CD40receptor-associated factor 1, U21092) sequences are essentiallyidentical to TRAF3 by BLAST analysis (in other words, TRAF3 is includedin this pathway).

TRAF3, which contains a conserved TRAF domain, binds to the CD40 (amember of the TNF receptor family) intracellular domain. Two other TRAFfamily members, TRAF4 and TRAF6, were included in this group. ILP(IAP-like protein, U32974), MIHC (human homolog of IAP C, U37546), andp73 (AF079094) were also found to be new members.

Baculovirus inhibitors of apoptosis (IAPs) can prevent insect celldeath. Both ILP and MIHC are human homologs of LAP. Rothe and colleagueshave shown that interactions of MIHC with TRAF1 and TRAF2 inhibitapoptosis. Similarly, ILP can regulate cell death. P73 is a p53-liketumor-suppressor protein which regulates the cell cycle and apoptosis.Direct interaction of p73 with LAP-like proteins or TRAFs has not beenreported, however. The inference analysis suggests that the apoptosispathways of p73 and IAP-like proteins intersect or even associatephysically.

We excluded genes specific to the BCL-2 apoptosis pathway, PP2A Bsubunit, and BCL2-associated athanogen, as the current apoptotic pathwayis more TNF specific. Finally, we also excluded two related proteins(RIP and RIPH) after we added domain information. We did this to weightthe YTH interaction pattern and domain information equally. When wetuned the domain weight up to five times the weight of YTH, RIP, andRIPH were retained, and we also introduced some unrelated proteins.

Although specific embodiments of the invention have been described,various modifications, alterations, alternative constructions, andequivalents are also encompassed within the scope of the invention. Thedescribed invention is not restricted to operation within certainspecific data processing environments, but is free to operate within aplurality of data processing environments. Additionally, although thepresent invention has been described using a particular series oftransactions and steps, it should be apparent to those skilled in theart that the scope of the present invention is not limited to thedescribed series of transactions and steps.

Further, while the present invention has been described using aparticular combination of hardware and software, it should be recognizedthat other combinations of hardware and software are also within thescope of the present invention. The present invention may be implementedonly in hardware, or only in software, or using combinations thereof.

The specification and drawings are, accordingly, to be regarded in anillustrative rather than a restrictive sense. It will however, beevident that additions, subtractions, deletions, and other modificationsand changes may be made thereunto without departing from the broaderspirit and scope of the invention as set forth in the claims.

1. A method for identifying biomolecule functions and relationships viathe manipulation of biological data from a plurality of sources,comprising: collecting biological data from a plurality of sources,wherein a source contains different attributes of biological data thananother source; calculating for each source a characteristic matrix fromthe biological data for that source, wherein each characteristic matrixhas a first dimension representing biomolecules and a second dimensionrepresenting attributes from the corresponding source, wherein eachmatrix element has a value for a specific biomolecule fit to a specificattribute; combining the characteristic matrices to create a combinedmatrix of higher dimension than each of the characteristic matrices,wherein combining includes calculating a weighting coefficientcorresponding to each of the characteristic matrices; and analyzing thecombined matrix to determine at least one of biomolecule functions andrelationships, wherein analyzing comprises clustering the biomoleculesis based on global similarity criteria between the biomolecules, whereina global similarity criterion between two biomolecules is composed fromsimilarity criteria between the values of each attribute of each sourcefor the two biomolecules.
 2. The method of claim 1, wherein the at leastone of biomolecule functions and relationships includes protein-proteininteractions.
 3. The method of claim 1, wherein clustering comprisesperforming an unsupervised clustering analysis.
 4. The method of claim1, wherein clustering the comprises using support vector machines orneural nets.
 5. The method of claim 1, wherein a global similaritycriterion is the distance from one biomolecule to another biomolecule inthe higher dimensional space of the combined matrix.
 6. The method ofclaim 1, wherein a global similarity criterion utilizes a Pearsoncorrelation coefficient relating one biomolecule to another biomoleculein the higher dimensional space of the combined matrix.
 7. The method ofclaim 1, wherein the sources comprise at least one source from aplurality of screening experiments and at least one source from aplurality of public or proprietary archival resources.
 8. The method ofclaim 7, wherein the plurality of screening experiments comprise highthroughput screening, yeast two-hybrid screening, protein arrayanalyses, nucleic acid array analyses, SiRNA analyses, and functionalgene screening.
 9. The method of claim 7, wherein biological data fromthe plurality of public or proprietary archival resources comprisedomain analyses, ontology vocabulary mapping, fold recognition, genesequences, protein sequences, expressed sequence tags, single nucleotidepolymorphisms, biochemical functions, physiological roles andstructure/function relationships from web-based or conventionalpublished literature.
 10. The method of claim 1, wherein a weightingcoefficient is determined using Bayesian statistics.
 11. The method ofclaim 1, further comprising identifying potential therapeutic targetsbased on the analysis.
 12. The method of claim 1 wherein combiningincludes: for each characteristic matrix, multiplying the matrixelements of that characteristic matrix by the weighting coefficient forthat characteristic matrix, wherein the weighting coefficients for twoof the characteristics matrices are different.
 13. A computer readablestorage medium having a plurality of instructions to direct a processingdevice to perform an operation for identifying biomolecule functions andrelationships via the manipulation of biological data from a pluralityof sources, the operation comprising the steps of: collecting biologicaldata from a plurality of sources; calculating for each source acharacteristic matrix from the biological data for that source, whereineach characteristic matrix has a first dimension representingbiomolecules and has a second dimension representing attributes from thecorresponding source, wherein each matrix element has a value for aspecific biomolecule fit to a specific attribute; combining thecharacteristic matrices to create a combined matrix of higher dimensionthan each of the characteristic matrices, wherein combining includescalculating a weighting coefficient corresponding to each of thecharacteristic matrices; and analyzing the combined matrix to determineat least one of biomolecule functions and relationships, wherein theanalyzing comprises clustering the biomolecules based on globalsimilarity criteria between the biomolecules, wherein a globalsimilarity criterion between two biomolecules is composed fromsimilarity criteria between the values of each attribute of each sourcefor the two biomolecules.
 14. The computer readable storage medium ofclaim 13 wherein combining includes: for each characteristic matrix,multiplying the matrix elements of that characteristic matrix by theweighting coefficient for that characteristic matrix, wherein theweighting coefficients for two of the characteristics matrices aredifferent.
 15. A method for identifying biomolecule functions andrelationships via the manipulation of biological data from a pluralityof sources, comprising: collecting biological data from a plurality ofsources, wherein a source contains different attributes of biologicaldata than another source, wherein the biological data from each sourcecontains a list of biomolecules, a list of attributes from that source,and a value for each biomolecule for each attribute from that source;and clustering the biomolecules according to the values for theattributes from the plurality of sources to determine at least one ofbiomolecule functions and relationships, wherein clustering thebiomolecules is based on global similarity criteria between thebiomolecules, wherein a global similarity criterion between twobiomolecules is composed from similarity criteria between the values ofeach attribute of each source for the two biomolecules, wherein a globalsimilarity criterion utilizes different weighting coefficients forvalues of the attributes of two of the sources.
 16. The method of claim15, wherein a global similarity criterion is a distance from onebiomolecule to another biomolecule, wherein the distance is calculatedfrom a difference between the values of each attribute of each sourcefor the two biomolecules.
 17. The method of claim 15, wherein a globalsimilarity criterion utilizes a Pearson correlation coefficientrelating, one biomolecule to another biomolecule by using the values ofeach attribute of each source for the two biomolecules.
 18. The methodof claim 15 wherein a lower weighting coefficient is used for the valuesof the attributes from a source with higher false positives than for thevalues of the attributes from a source with lower false positives.